In this unit, we’re going to learn how to make graphs. The urge to present data in a pictorial format is an ancient one, and you are sure to find a primordial satisfaction in learning how to do so effectively.

Why Present Data Visually?

What are the benefits of displaying data visually?

  • Your audience can understand it, and quickly. A picture is attention-grabbing, and can be much more concise that text - it is worth a thousand words, they say.
  • You can understand it. The act of deciding what information needs to be presented and how to most effectively present it forces you to understand your data better.
  • Aforementioned primordial satisfaction.

Variable Basics Review

Let’s take a moment to review what we know about variables. This will prove beneficial later.

  1. Give an example of a ‘continuous’ variable.
  2. How would a continuous variable be represented in R (number, factor, string)?
  3. Give an example of a ‘categorical’ variable.
  4. How would a categorical variable be represented in R (number, factor, string)?

Types of Visualizations

There are many ways to present data visually, including but not limited to:

  • Histogram
  • Density Plot
  • Scatter Plot
  • Line Graph
  • Bar Graph
  • Box Plot
  • Violin Plot
  • Bubble Plot
  • Heat Maps

Histogram

A histogram has:

  • x-axis: a continuous variable that you want to describe
  • y-axis: a count of # of data points at each value of x-axis

It is used to:

  • visualize a single variable
  • see which values of a variable are most common.
Histogram (From Wikimedia)
Histogram (From Wikimedia)

Density Plot

A density plot has:

  • x-axis: a continuous variable that you want to describe
  • y-axis: probability of a given value occurring in the data

It is used to:

  • visualize a single variable
  • see which values of a variable are most likely.
Density Plot (From Wikimedia
Density Plot (From Wikimedia

Histograms & Density Plots

Histograms and Density plots are highly compatible and convey similar information.

They can be overlaid to make a nice graph.


Scatterplot

A scatterplot has:

  • x-axis: a continuous variable (usually a predictor variable)
  • y-axis: a continuous variable (usually the dependent variable)

It is used to:

  • visualize a relationship between 2 continuous variables
  • see how changes in one variable are related to changes in another.
Scatterplot (From sthda.com)
Scatterplot (From sthda.com)

A scatterplot also often includes:

  • a trendline: a line that shows the trend of the relationship

This line might be:

  • a free-flowing line that shows the overall trend in the data (unlimited degrees of freedom)
  • a line or simple curve that reflects the results of a statistical analysis (limited degrees of freedom)
Scatterplot with trendline (From sthda.com)
Scatterplot with trendline (From sthda.com)
Scatterplot with linear trendline (From sthda.com)
Scatterplot with linear trendline (From sthda.com)

Line Graph

A line graph has:

  • x-axis: A continuous variable (usually representing time or some other, similar variable)
  • y-axis: a continuous variable (usually the dependent variable)

It is used to:

  • visualize how one variable changes across time (or across some other continuous variable)
Line Chart (From r-graph-gallery.com)
Line Chart (From r-graph-gallery.com)

Bar Graph

A bar graph has:

  • x-axis: 1 or more categorical variables (usually a predictor variable)
  • y-axis: a continuous variable (usually the dependent variable)

It is used to:

  • visualize a relationship between categorical variable(s) and a continuous variable
  • show differences between groups
Bar Chart (From sthda.com)
Bar Chart (From sthda.com)

A bar graph often also has:

  • error bars: whiskers on each bar

Error bars are used to:

  • express the uncertainty of the measurement (“the true mean is somewhere in here”)
  • show statistical significance (if the error bars of two conditions overlap, they are not statistically different)
Bar Chart with Error Bars (From sthda.com)
Bar Chart with Error Bars (From sthda.com)

Box Plot

A box plot has:

  • x-axis: a categorical grouping variables (usually a predictor variable)
  • y-axis: a continuous variable (usually the dependent variable)

These axes can be switched.

What does a box-plot show?

  • The horizontal line shows the median value for each group.
  • The box encompasses the 25th to 75th percentile (i.e. the box represents 50% of the data points)
  • The vertical lines extend to the minimum and max for the data (excluding outliers)
  • The dots show outliers

A box plot is used to:

  • show means and variability for multiple groups
  • show differences between groups
Box Plot (From r-graph-gallery.com)
Box Plot (From r-graph-gallery.com)

Violin Plot

A violin plot is a variation on the box plot.

Instead of using boxes and lines, it simply shows a sideways density plot for each group.

A violin plot and box plot can be combined for an extra informative (and classy) graph.

Violin Plot (From r-graph-gallery.com)
Violin Plot (From r-graph-gallery.com)

Violin Plot & Box Plot

A violin plot and box plot can be combined for an extra informative (and classy) graph.

Violin Plot with Box Plot (From r-graph-gallery.com)
Violin Plot with Box Plot (From r-graph-gallery.com)

Bubble Plot

A bubble plot takes a variety of forms.

It is usually either:

  • a simple grid (see below) where both axes are categorical variables
  • some version of a scatterplot, where both axes are continuous.

The key feature of a bubble plot is that each point is scaled to reflect the value of some third variable

It is used to:

  • show how some third continuous variable interacts with 2 other variables
Bubble Plot (From jkzorz.github.io)
Bubble Plot (From jkzorz.github.io)

A bubble plot takes a variety of forms.

It is usually either:

  • a simple grid (see below) where both axes are categorical variables
  • some version of a scatterplot, where both axes are continuous.

The key feature of a bubble plot is that each point is scaled to reflect the value of some third variable

It is used to:

  • show how some third continuous variable interacts with 2 other variables
Bubble Plot (From r-graph-gallery.com)
Bubble Plot (From r-graph-gallery.com)

Heat Map

A heat map is another way to represent a 3rd variable

The difference is that the bubble plot uses size to represent some third variable’s relationship to two variables, while a heat map uses color. The color usually scales from ‘cool’ colors (blues) to ‘hot’ colors (reds) - hence the name heat map.

Like a bubble plot, a heat map is usually either:

  • a simple grid where both axes are categorical variables
  • some version of a scatterplot, where both axes are continuous.
Heatmap showing eye tracking data (From i-insider.com)
Heatmap showing eye tracking data (From i-insider.com)

some other examples can be found here: (https://r-graph-gallery.com/heatmap.html)



Different Types of Graphs (Checklist items 5-10)

Identify each of the graphs below. What kind is it?

Graph 5
Easy, right?
Easy, right?
Graph 6
This one’s easy
This one’s easy
Graph 7
Here’s Another
Here’s Another
Graph 8
I think this might be a combination of two plot types
I think this might be a combination of two plot types
Graph 9
Now this one
Now this one
Graph 10
How’s YOUR consumer confidence
How’s YOUR consumer confidence

All example graphs from r-graph-gallery.com

And Many More

  1. Find a graph on the internet that is NOT of the same type as the ones we’ve discussed. Include a link, and describe when you think a graph of this type would be handy.


When to use different graphs

  1. In R Markdown, make a table like the one below. But don’t leave it empty! Categorize the following types of graphs by placing them in the correct column to complete the table:
  • Line Graph
  • Heat Map
  • Density Plot
  • Bubble Plot
  • Histogram
  • Violin Plot
  • Box Plot
  • Bar Graph
  • Scatterplot
Number and Types of Variables for Different Graphs
One continuous Two continuous One categorical, one continuous Three, at least one continuous
?
?
?

Picking a graph based on data

  1. Consider the example data below. What type of graph would you want to use here?
Participant Condition Observation
1 A 5
2 A 2
3 A 7
4 A 4
5 A 1
6 B 10
7 B 4
8 B 10
9 B 5
10 B 9
  1. Now consider this data. What type of graph might you want to use here?
Musher Checkpoint Time
Jerry Sousa 1 243
Jerry Sousa 2 176
Jerry Sousa 3 304
Jerry Sousa 4 201
Melissa Owens 1 215
Melissa Owens 2 421
Melissa Owens 3 334
Melissa Owens 4 220

For the next three questions, we will look at the mtcars data set, which comes pre-loaded in R. To learn more about this data set, check the help page (?mtcars)

select(mtcars, c(1:6)) %>% # We'll just look at the first 6 columns
  knitr::kable() 
mpg cyl disp hp drat wt
Mazda RX4 21.0 6 160.0 110 3.90 2.620
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875
Datsun 710 22.8 4 108.0 93 3.85 2.320
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440
Valiant 18.1 6 225.0 105 2.76 3.460
Duster 360 14.3 8 360.0 245 3.21 3.570
Merc 240D 24.4 4 146.7 62 3.69 3.190
Merc 230 22.8 4 140.8 95 3.92 3.150
Merc 280 19.2 6 167.6 123 3.92 3.440
Merc 280C 17.8 6 167.6 123 3.92 3.440
Merc 450SE 16.4 8 275.8 180 3.07 4.070
Merc 450SL 17.3 8 275.8 180 3.07 3.730
Merc 450SLC 15.2 8 275.8 180 3.07 3.780
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250
Lincoln Continental 10.4 8 460.0 215 3.00 5.424
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345
Fiat 128 32.4 4 78.7 66 4.08 2.200
Honda Civic 30.4 4 75.7 52 4.93 1.615
Toyota Corolla 33.9 4 71.1 65 4.22 1.835
Toyota Corona 21.5 4 120.1 97 3.70 2.465
Dodge Challenger 15.5 8 318.0 150 2.76 3.520
AMC Javelin 15.2 8 304.0 150 3.15 3.435
Camaro Z28 13.3 8 350.0 245 3.73 3.840
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845
Fiat X1-9 27.3 4 79.0 66 4.08 1.935
Porsche 914-2 26.0 4 120.3 91 4.43 2.140
Lotus Europa 30.4 4 95.1 113 3.77 1.513
Ford Pantera L 15.8 8 351.0 264 4.22 3.170
Ferrari Dino 19.7 6 145.0 175 3.62 2.770
Maserati Bora 15.0 8 301.0 335 3.54 3.570
Volvo 142E 21.4 4 121.0 109 4.11 2.780
  1. If I wanted to visualize the relationship between mpg (Miles per gallon) and wt (Weight), what sort of graph should I make?
  2. If I wanted to visualize the difference in hp (horsepower) between all the different Merc models, what would be the best graph to use?
  3. If I wanted to visualize the single variable cyl (number of cylinders) to see which values were common and which uncommon, what would be the best graph to use?

Basics of ggplot

Every time you create a plot with ggplot, follow these 7 steps:

  1. Call the ggplot function and specify what data to use
  2. Add axis ‘aesthetics’ using aes() to lay out the plot area
  3. Add a plot layer
  4. Customize the plot layer
  5. Add more layers as needed
  6. Finalize the plot
  7. Save the plot

Step 1: Call the ggplot function and specify what data to use

ggplot(data = diamonds)

Don’t bother to run this yet, you won’t get anything.

PRO TIP: In R Markdown, create a new code chunk for each plot you make and keep other code out of these chunks as much as possible. This way, you can easily move plots around within your document and customize them as desired.

Step 2: Add axis ‘aesthetics’ using aes() to lay out the plot area

ggplot(data = diamonds, aes(x = carat, y = price))

Now we’ve specified what the x and y axis should be, and ggplot has laid it out for us. Next, we need to put something on the graph.


Step 3: Add a plot layer

Now that we have the ‘foundation’ of our plot, we can add layers, using +

The type of layer(s) you add will determine what kind of plot you make.

ggplot(data = diamonds, aes(x = carat, y = price)) + 
  geom_point()

Now we’ve specified what the x and y axis should be, and ggplot has laid it out for us. Next, we need to put something on the graph.

Notice how I used the + to add a new piece to this command. The ggplot package is unique in stringing together functions in this way.

The geom_point() command adds a layer of points to our plot. There are lots of geoms you can add. We’ll see a few of the most useful ones later.

Step 4: Customize a plot layer.

The real fun of ggplot is the ability to customize your plots - to make them look exactly how you want. Struggling to decide between a career as a data scientist or as an artist? Making plots with ggplot lets you do both!

Different geoms can take different arguments, called ‘aesthetics’, that can be customiezed. Some common ones are size, color, shape, and fill.

The help page for each geom function tells you what aesthetics are available for it.

ggplot(data = diamonds, aes(x = carat, y = price)) + 
  geom_point(size = 2, 
             color = "blue", 
             shape = 1, 
             stroke = 1)

Understanding aes()

In ggplot, we can specify aesthetics in 2 different ways:

Outside aes()

This is what we do when we want some aesthetic property of our graph to be constant. The previous example used this method, making all of the points blue circles at are size 2

Inside aes()

This is what we do when we want some property of the graph to vary based on some variable in our data. For example, if we want the color of the points to vary depending on the clarity of the diamonds, and the shape of the points to vary based on the cut of the diamonds, we put these arguments inside aes():

ggplot(data = diamonds, aes(x = carat, y = price)) + 
  geom_point(size = 2, 
             aes(color = clarity, shape = cut))

Notice that the dots come in different colors and shapes now. Also, there are now legends that helps us interpret these colors and shapes. Finally, notice that size = 1 is NOT inside aes(), because we want this to be constant, applying to ALL points in the same way.

Your Turn

Consider step 2 again:

ggplot(data = diamonds, aes(x = carat, y = price))

Why are the x and y axes defined inside an aes()?


Step 5: Add more layers as needed

We don’t have to stop at just 1 layer. Let’s add a trendline.

ggplot(data = diamonds, aes(x = carat, y = price)) + 
  geom_point(size = 2, 
             aes(color = clarity, shape = cut)) +
  geom_smooth(color = "black", fill = "white")

Each new layer is added on top of the previous layer - notice how the line covers the points. If we switch the order, we can move layers up or down.

ggplot(data = diamonds, aes(x = carat, y = price)) + 
  geom_smooth(color = "black", fill = "white") +
  geom_point(size = 2, 
             aes(color = clarity, shape = cut))

But its hard to see the line this way, so I think I liked the first order better.

Step 6: Finalize the plot

Now that the plot looks the way we want, we can add some further customization.

Let’s start with 3 things:

  • changing the theme.
  • adding/changing labels.
ggplot(data = diamonds, aes(x = carat, y = price)) + 
  geom_point(size = 2, 
             aes(color = clarity, shape = cut)) +
  geom_smooth(color = "black", fill = "white") + 
  theme_bw() + 
  labs(
    title = "Diamond Plot",
    subtitle = "in case you are interested",
    caption = 
      "Fig. 1. Some relevant data about a girl\'s best friend",
    x = "How Many Carats?",
    y = "Price (in $)",
    tag = NULL, 
    #Useful if your figure is part of a larger multi-panel figure
    alt = "Oops, the plot is missing" 
    # Alt text for websites when the plot doesn't load
  )

Now let’s briefly discuss theme(). You can use this to further customize your graph in almost any way you want, including:

  • modifying the plot area
  • modifying the axis text, titles, and ticks
  • changing font styles and sizes
  • modifying the legend’s look, position, side, etc.
  • and more!

Here’s a simple example, mostly focused on customizing the legend:

ggplot(data = diamonds, aes(x = carat, y = price)) + 
  geom_point(size = 2, 
             aes(color = clarity, shape = cut)) +
  geom_smooth(color = "black", fill = "white") + 
  theme_bw() + 
  labs(
    title = "Diamond Plot",
    subtitle = "in case you are interested",
    caption = 
      "Fig. 1. Some relevant data about a girl\'s best friend",
    x = "How Many Carats?",
    y = "Price (in $)"
  ) + theme(text = element_text(size = 20), 
            legend.position = c(0.9, 0.4), 
            legend.text = element_text(size = 5), 
            legend.title = element_text(size = 8),
            legend.key.height = unit(0.1, "in"))

Step 7: Save the plot

If you’re using R Markdown (and you should be!), then you can skip this step - your plots will show up in your document automatically (after you run your code).

But if we wanted to save our beautiful graph to an image file, we would use ggsave().

We can specify the file type (by giving the file name an appropriate extension, like .png) width, height, and dots per inch.

ggsave("MyBeautifulPlot.png", width = 6, height = 4, units = "in", dpi = 300)

Before we make a new graph from scratch, let’s practice adjusting the aesthetics on an existing one.

  1. Run the code below. Now adjust the code so your graph looks like the one further below.
    • Note: The colors I used were red4 and blue4. The fills I used were red1 and blue1. The shapes I used were 21 and 22.
    • Also notice the change in the background color from gray to white.
iditaroddata <-  cbind(c(rep("Jerry Sousa", 4), rep("Melissa Owens", 4)), c(1:4, 1:4), c(243,
176,
304,
201,
215,
421,
334,
220
)) %>% data.frame() %>% rename(Musher = X1, Checkpoint = X2, Time = X3)

ggplot(data = iditaroddata, aes(x = Checkpoint, y = Time, group = Musher)) + 
          geom_line() +
          geom_point(size = 5)

Color Considerations

When making your graphs, keep the following in mind:

  • Not everyone can see all colors
  • AND many academic journals don’t print in color
    • Or they make you pay extra!
  • SO, make sure your graphs do not require color to be interpreted
  • This means that you should always have more than one way to differentiate two groups
    • e.g. color and shape/line type

Color

  1. Give 2 reasons that a graph should not rely exclusively on color to show the difference between two groups.
  2. What can you do to make sure that your graph is interpretable even without color?

Making Different Types of Plots

I’ll be using data from the nycflights13 package for these demos. So, let’s load it.

if (!require("nycflights13")) install.packages("nycflights13")
library(nycflights13)

This package includes the flights data set, over 336776 flights in and out of NYC in 2013. The date, time, scheduled time, carrier, origin, destination, air time, and distance traveled by each flight are included data points.

You know how to explore a data set by now! Take a moment to look at the flights data before moving on.

Histogram

Let’s start by making a histogram. We’ll look at every air traveler’s worst enemy: departure delays.

Steps 1-3: Set up the plot area and add a histogram

First we’ll set up ggplot, then add a histogram layer with geom_histogram().

In the code below, notice that no Y axis is specified. For histograms (and density plots), only the X axis is needed. The Y axis is computed from the data.

ggplot(data = flights, aes(x = dep_delay)) + geom_histogram() 

Step 4: customize our histogram

Let’s make our graph look better. We’ll make 3 changes:

  • The bins (individual bars) are too wide (defaulted to 30 minutes) - we want a more fine-grained representation
  • The color is bland
  • The inside of the bars is the same color as the outside - I don’t like this, personally.
ggplot(data = flights, aes(x = dep_delay)) + 
  geom_histogram(binwidth = 2, color = "darkblue", fill = "lightblue")

We don’t need any more layers, so we’ll skip step 5: add further layers.

Step 6: Finalize the plot

We need to further finalize the plot. There are 3 changes I want to make:

  • I want to zoom in the X axis - we have some extreme outliers here (some flights are really late).
  • Let’s add/customize our labels - those variable names are not adequate.
  • I don’t love the gray background of the plot area.
ggplot(data = flights, aes(x = dep_delay)) + 
  geom_histogram(binwidth = 2, color = "darkblue", fill = "lightblue") +
  coord_cartesian(xlim = c(-50, 200)) + # This zooms in the graph
  labs(
    x = "Departure Delay (minutes)",
    y = "How Many Flights?",
    title = expression(paste(bold("Figure 1."), " Histogram of Flight Departure Delays"))
  ) + theme_bw() 

Density Plot

Next, let’s make a density plot. We’ll start with our histogram, but change the geom_histogram to geom_density:

ggplot(data = flights, aes(x = dep_delay)) + 
  geom_density() +
  coord_cartesian(xlim = c(-50, 200)) + # This zooms in the graph
  labs(
    x = "Departure Delay (minutes)",
    y = "Probability",
    title = expression(paste(bold("Figure 2."), " Density Plot of Flight Departure Delays"))
  ) + theme_bw() 

That’s OK, but we can do better. Let’s customize the density plot layer by:

  • Changing the line color
  • Changing the line type
  • Adding a fill
ggplot(data = flights, aes(x = dep_delay)) + 
  geom_density(color = "darkblue", linetype = 2, fill = "lightblue") +
  coord_cartesian(xlim = c(-50, 200)) + # This zooms in the graph
  labs(
    x = "Departure Delay (minutes)",
    y = "Probability",
    title = expression(paste(bold("Figure 2."), " Density Plot of Flight Departure Delays"))
  ) + theme_bw() 

Combining the Density Plot and Histogram

I love putting the density plots and histograms together. They’re highly compatible.

BUT, this takes a bit of work because the Y axes are different for these two plot types and we need to make them match.

Notice:

  • In the geom_histogram, the Y axis is defined as aes(y = ..density..). In other words, we’re saying “use the density plot’s y axis scale instead.”
  • Putting the histogram lower in the code puts the bars on top of the density plot.
  • But, since I still want to see the density plot, I’ve made the bars somewhat transparent using alpha = 0.75. 0 is invisible, 1 is solid.
ggplot(data = flights, aes(x = dep_delay)) +
  geom_density(color = "black", linetype = 1, size = 1) +
  geom_histogram(aes(y = ..density..), binwidth = 2, color = "darkblue", fill = "lightblue", alpha = 0.75) + # We have to put these two plots on the same scale
  coord_cartesian(xlim = c(-50, 200)) + # This zooms in the graph
  labs(
    x = "Departure Delay (minutes)",
    y = "Probability",
    title = expression(paste(bold("Figure 3."), " Density Plot (and Histogram) of Flight Departure Delays"))
  ) + theme_bw() 

  1. Using the mtcars data (which is preloaded in R, so just say data = mtcars in ggplot), make the graph below.
    • Note: The histogram bin width is .2. The histogram color is “black” and the fill is “gray40”. The density plot color is “black” and the fill is “alpha(”gray”, 0.5)“.
    • Also note the X axis label.

Scatterplot

It’s scatterplot time!

In this graph, we’ll show the relationship between distance traveled and time in the air. This one takes longer to run, because there are many points to plot

Note: google “ggplot shapes” to find the number codes for different shapes. Or go here: (http://sape.inf.usi.ch/quick-reference/ggplot2/shape) Only shapes 21-25 can take a fill aesthetic that is different from their color.

ggplot(data = flights, aes(y = distance, x = air_time)) +
  geom_point(color = "black", fill = "gray", shape = 21, size = 3) +
  labs(
    y = "Flight Distance (miles)",
    x = "Flight Time (minutes)",
    title = expression(paste(bold("Figure 7. "), "Scatter Plot of Flight Distance by Flight Duration"))
  ) + theme_bw()

But now, let’s add a trend line using geom_smooth.

We’ll add 3 arguments:

  • method = lm. This says we want to make a regression line.
  • formula = y ~ x. This shows the formula for our regression line. In this case, a straight line.
ggplot(data = flights, aes(y = distance, x = air_time)) +
  geom_point(color = "black", fill = "gray", shape = 21, size = 3) +
  geom_smooth(method = lm, formula = y ~ x) + 
  labs(
    y= "Flight Distance (miles)",
    x = "Flight Time (minutes)",
    title = expression(paste(bold("Figure 7. "), "Scatter Plot of Flight Distance by Flight Duration"))
  ) + theme_bw()

That’s a strong linear trend!

Let’s make the points and lines different for the different airports.

ggplot(data = flights, aes(y = distance, x = air_time)) +
  geom_point(aes(color = origin, fill = origin, shape = origin), size = 3) +
  geom_smooth(method = lm, formula = y ~ x, aes(color = origin)) + 
  labs(
    y= "Flight Distance (miles)",
    x = "Flight Time (minutes)",
    title = expression(paste(bold("Figure 7. "), "Scatter Plot of Flight Distance by Flight Duration"))
  ) + theme_bw()

Wouldn’t it be nice to specify what the colors, shapes, etc. should be? We can do that!

We can use a pre-build color palette, like these: (http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#palettes-color-brewer)

Or we can define our own, using these colors: (http://sape.inf.usi.ch/quick-reference/ggplot2/colour)

Instead of typing the whole graph code over and over, we’ll give the graph a name and then add to it, like so:

flightscatter <- ggplot(data = flights, aes(y = distance, x = air_time)) +
  geom_point(aes(color = origin, fill = origin, shape = origin), size = 3) +
  geom_smooth(method = lm, formula = y ~ x, aes(color = origin)) + 
  labs(
    y= "Flight Distance (miles)",
    x = "Flight Time (minutes)",
    title = expression(paste(bold("Figure 7. "), "Scatter Plot of Flight Distance by Flight Duration"))
  ) + theme_bw()  

flightscatter +
    scale_color_manual(values = c("coral4", "dodgerblue4", "grey22")) + 
    scale_fill_manual(values = c("coral1", "dodgerblue1", "grey")) +
    scale_shape_manual(values = c(21, 22, 24)) 

This allows you to mess around with colors without re-doing the whole graph each time.

Let’s also rename and move the legend. How did I do those two things?

flightscatter <- flightscatter +
    scale_color_manual(values = c("coral4", "dodgerblue4", "grey22")) + 
    scale_fill_manual(values = c("coral1", "dodgerblue1", "grey")) +
    scale_shape_manual(values = c(21, 22, 24)) +
    labs(color = "Airport", fill = "Airport", shape = "Airport") +
    theme(
     legend.position = c(0.10, 0.80) 
     # These numbers represent a proportion of the chart area, 
     # first the x coord, then the y
    )  

flightscatter

That looks pretty good. Now it’s your turn!

  1. Using the mtcars data, make this graph:
    • Note The point size is 5, color = “black”, fill = “green4”, shape = 21. Use the same color/fill for the line.

Faceting

Let’s discuss faceting for a moment. To facet a graph is to create multiple separate plot areas. The code below divides the plot into three plots, one for each of the 3 airports. They are distributed horizontally. You could switch the . and the origin to arrange the facets vertically instead. And you could add another variable in place of the . to make a grid of graphs. See also facet_wrap().

flightscatter + facet_grid(. ~ origin)

Bar Graph

Let’s make a bar graph. This takes a bit more setup than the others, because we usually want to plot the means, not the raw data. And if we want to add standard error bars (we do!), we’ll need to compute the standard errors as well.

library(plotrix) # for the std.error function
flights_means <- flights %>%
  group_by(origin) %>%
  summarise(duration = mean(air_time, na.rm = TRUE), se = std.error(air_time))

Now to the actual plot. Note the stat = “identity”. By default, geom_bar plots the count of the data points (i.e. how many). So, without stat = “identity”, the bar chart would show us how many flights there were from each airport. But we don’t want that right now. Instead, we want to show the mean, which is the value in the data, the identity of the number in the cell. So we say so. Get it?

We also want the error bars to extend from the means (duration) up and down one standard error. Hence aes(ymin=duration-se, ymax=duration+se).

ggplot(data = flights_means, aes(x = origin, y = duration)) +
  geom_bar(stat = "identity") + 
  geom_errorbar(aes(ymin=duration-se, ymax=duration+se),
                width=.2, # Width of the error bars
                position=position_dodge(.9))

I’ll worrk about making this one pretty later.

  1. Using the mtcars data, make this graph:
    • Note You’ll need to summarize the mtcars data to get the means and standard errors for each cylinder, then pass this summarized data tp ggplot. You’ll also need to use as.factor() to change cyl from a numeric variable to a factor.
    • Note Color = “black”, fill = default colors. The error bar width = 0.2
    • Note I took out the legend.

Box Plot & Violin Plot

Let’s move on to box plots. The first difference is that we need to define both an X and Y axis. We’ll plot flight distance (Y axis) by origin (i.e. what airport the flight left from; X axis).

ggplot(data = flights, aes(y = distance, x = origin)) + 
  geom_boxplot(color = "darkblue", linetype = 2, fill = "lightblue") +
  labs(
    y = "Flight Distance (miles)",
    x = "Airport",
    title = expression(paste(bold("Figure 4. "), "Box Plot of Flight Distance by Airport"))
  ) + theme_bw()

geom_violin works just like geom_boxplot, so to make a violin plot we’ll just change the geom name:

ggplot(data = flights, aes(y = distance, x = origin)) +
  geom_violin(color = "darkblue", linetype = 2, fill = "lightblue") +
  labs(
    y = "Flight Distance (miles)",
    x = "Airport",
    title = expression(paste(bold("Figure 5. "), "Violin Plot of Flight Distance by Airport"))
  ) + theme_bw()

To combine violin and box plots, we have to adjust their widths.

ggplot(data = flights, aes(y = distance, x = origin)) +
  geom_violin(color = "black", linetype = 1, fill = "gray", width = 1.4) +
  geom_boxplot(color = "darkblue", linetype = 2, fill = "lightblue", width = 0.02) +
  labs(
    y = "Flight Distance (miles)",
    x = "Airport",
    title = expression(paste(bold("Figure 5. "), "Violin Plot of Flight Distance by Airport with Box Plot"))
  ) + theme_bw()

Real Data: Do People Want to be Moral?

  1. Read in your file ‘moraldatalong.csv’
  2. Use this data to make the following graph:
    • Note Notice that the Traits are ordered from highest to lowest. You’ll have to figure out how to do this (hint: google “r reorder”). Also note that the fill colors for the traits are reordered, too, so you have to do the reordering 2 times in the code.

Real Data: My Data Will Go On

Now let’s try some more advanced graph-making, including facets and themes.

  1. Run this code to transform the pre-installed Titanic data into a long-form data set called TitanicData
TitanicData <- data.frame(Titanic)
  1. Now make the following graph:
    • Note Yes = gray15, No = red2, legend is at 0.6, 0.3. Also, use scales = “free_y” in the facet_grid.

  1. (Bonus) Alter your Titanic graph code from above to make the Figure Caption be APA style, like this:

  1. (Bonus) Alter your graph’s themes so it looks like this:
    • Note backgrounds are black, legend and strips are gray35, text is white and bold. panel grid lines (x axis) have been removed, as have minor y grid lines.

  1. (Bonus) Now put this picture into your graph:

https://upload.wikimedia.org/wikipedia/commons/4/4f/Titanic_the_sinking.jpg

Like so: * Hint packages required include jpeg and ggimage

Some other Graph Types

Here are some other graph types you may need to make someday.

Line Graph

Let’s make a line graph. We’ll plot how departure delays change across the year.

We must do a bit of data preparation for this, since we want to plot the means, not the raw data.

library(plotrix) # for the std.error function
month_means <- flights %>%
  group_by(month, origin) %>%
  summarise(departure_delay = mean(dep_delay, na.rm = TRUE), se = std.error(dep_delay, na.rm = TRUE))

And now on to the plot. We’ll put some points in (with geom_point()), and connect those points with geom_line().

Other things I did (can you figure out which command does these things?):

  • Added error bars (easy!)
  • Formatted the x axis to show the month names instead of numbers
  • Turned the month names sideways and moved them around a bit
ggplot(data = month_means, aes(y = departure_delay, x = month)) +
  geom_line(aes(color = origin, linetype = origin)) + 
  geom_point(aes(color = origin, fill = origin, shape = origin), size = 3, alpha = 0.75) +
  geom_errorbar(aes(ymin=departure_delay-se, ymax=departure_delay+se),
                width=.2, # Width of the error bars
                position=position_dodge(.9)) +
  labs(
    y = "Departure Delay (minutes)",
    x = "Month",
    color = "Airport",
    linetype = "Airport",
    fill = "Airport",
    shape = "Airport",
    title = expression(paste(italic("Figure 7. "), "Line Plot of Flight Delay by Month"))
  ) + theme_bw() +
  scale_color_manual(values = c("coral4", "dodgerblue4", "grey22")) + 
  scale_fill_manual(values = c("coral1", "dodgerblue1", "grey0")) +
  scale_shape_manual(values = c(21, 22, 24)) + 
  scale_x_continuous(labels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"),
                     breaks = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
  ) + theme(axis.text.x = element_text(vjust = 0.25, hjust = 1, angle = 90),
            legend.position = c(0.78, 0.8))

Bubble Plot

A bubble plot is really just a scatterplot, except that the size of the points varies by some 3rd variable.

library(plotrix) # for the std.error function
flights_means <- filter(flights, carrier %in% c("AA", "B6", "DL", "EV", "MQ", "UA", "US")) %>%
  group_by(origin, carrier) %>%
  summarise(duration = mean(air_time, na.rm = TRUE), departure_delay = mean(dep_delay, na.rm = TRUE), se = std.error(air_time, na.rm = TRUE))

ggplot(data = flights_means, aes(y =carrier, x = origin)) +
  geom_point(aes(size = departure_delay), color = "blue", shape = 21, fill = "lightblue") +
  labs(
    y= "Airline",
    x = "Airport",
    fill = "Departure Delay",
    title = expression(paste(italic("Figure 9. "), "Bubble Plot of Departure Delay"))
  ) + theme_bw()

Heatmaps

To get a grid-like heatmap, use geom_tile() or geom_raster()

ggplot(data = flights_means, aes(y =carrier, x = origin)) +
  geom_raster(aes(fill = departure_delay)) +
  scale_fill_gradient(low="black", high="red") +
  labs(
    y= "Airline",
    x = "Airport",
    fill = "Departure Delay",
    title = expression(paste(italic("Figure 10. "), "Heat Map of Departure Delay"))
  ) + theme_bw()

Now you’re ready to make some graphs of your own!

  1. (Bonus) Read in ‘marriagedata.csv’ from Checklist 4. Make a line graph showing change in marital satisfaction over time. Include separate lines for husband and wife, and points with error bars in addition to the lines.